The author decided in the first place to have a look at both wine data sets, which have been provided by udacity. By having a quick look into the datasets, the author discoverd that the data sets are very similar. Instead of only analysing one of these datasets, the author decided to combine both data sets to an aggregated data set.
The author always had a big interest in wine - red or white. An aggregated data set allows the author to explore both types of wine and see if there are differences among the type of wines. Although, the author itself likes to taste and discuss about wine, he never had the opportunity to analyse an aggregated data set with this many values from a laboratory.
THe most interesting overall question is going to be: What aspects define a high quality wine? But the data set offers even more potential for exploration. The author will try to uncover hidden relationship at this point among variables, as well as to try to uncover differences among the types of wine. Last, the author will try to build a predictive model to forecast the quality of a wine.
Both data sets were obtained from Cortez, 2009 (Cortez et al., 2009).
Since variable 1 (row number) was not needed, the variable was dropped from further analysis. On the other hand, the author added in both data sets a new variable called “type”, which containes either the value “red” or “white”, corresponding from its origine data set. The final data set for the analysis containes the following variables:
1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3)
3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3)
5 - chlorides (sodium chloride - g / dm^3
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulphates (potassium sulphate - g / dm3)
11 - alcohol (% by volume)
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
13- type (red or white)
Description of variable (Cortez et al., 2009):
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
13 - type: either red or white wine.
## [1] "dimension of red wine is : 1599 observations and 13 variables"
## [1] "dimension of white wine is : 4898 observations and 13 variables"
## [1] "dimension of complete data set is : 6497 observations and 13 variables"
## 'data.frame': 6497 obs. of 13 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ type : chr "red" "red" "red" "red" ...
All of the varaibles have the desired class. Therefore, no further data wrangling is needed at this point.
## df_wine$type: red
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality type
## Min. :3.000 Length:1599
## 1st Qu.:5.000 Class :character
## Median :6.000 Mode :character
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
## --------------------------------------------------------
## df_wine$type: white
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality type
## Min. :3.000 Length:4898
## 1st Qu.:5.000 Class :character
## Median :6.000 Mode :character
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
The first quick summary statistic, which was grouped by the type (red or white) of wine, indicated interesting differences. Red wine seems to contain higher levels of fixed and volitile acidity on average, while white wine shows higher levels of citric acid. White wine seem to have more residual sugar and alcohol and higher density. These variables might have a relationship with each other, what would be explored later. Also the data indicates that white wine has a better quality in average compared to red.
## df_wine$type: red
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
## --------------------------------------------------------
## df_wine$type: white
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
##
## 3 4 5 6 7 8 9
## 30 216 2138 2836 1079 193 5
The first chart reveals that both quality variables are slightly skewed to the right. It furhter follows the summary statistic by indicating that the maximum value in terms of quality for white wine is 9, while the maximum of red wine was 8. But it reveals that the the value of 9 was only reached very rarely.The boxplot shows these datapoints as outliers, but they are kept since they are valid data points. The binwidth further showed that all values are integers. The boxplot indicates that white wine has a slighter higher mean quality than red wine. The bar plot finally reveals that there are much more white wine (approx. 4900) than red wine (approx. 1700) in the data set. It would be interesting to look at this variable under acidity, residual sugar, pH and alcohol.
## df_wine$type: red
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
## --------------------------------------------------------
## df_wine$type: white
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
The pH plot shows a right skewed distribution of the pH values, independently of the wine type. Most data point are within 3 and 3.5, whereby some data points reach slightly lower (min. 2.74) or higher (max. 4.01) values. All data points seem to be reasonable, indicating that all wines in the data set seem to be more acid than basic. Let’s see how the distribution of fixed acidity looks like.
## df_wine$type: red
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## --------------------------------------------------------
## df_wine$type: white
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
Fixed acidity follows the distribution sheme of the other variables and is therefore also skewed to the right. Most values reach a value between 4 and 8. The boxplot revales that white whine has normally lower levels of fixed acidity, whereas the red wine reaches higher levels. Eventhough, there are some outliers, they still seem to be valid daat points and therefore, they are not dropped from the data set. What pattern can we find in volatile acidity?
It is interesting to find the same pattern within volatile and fixed acidity. The histogram reveals that the variables are skewed to the right, whereas the boxplot reveals that red wine tend reaches higher levels of volatile acidity than white wine. The boxplot of the white wine sections indicates many oujtliers, which may be removed. Nevertheless, we will keep these data points for further analysis. Let’s have a look for the last type of acidity: citric acid.
Interestingly, this variable looks more normally distributed. Even more surprising, this variable reveals that white whine tend to have higher levels of citric acid compared to red wine. The disctription of the variable may reveal the reason why:
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
Red wine usually is not famous for its “refreshing” taste. This might be the reason why white wine contains higher levels of citric acid. Here as well, outliers look like valid data points and are kept in the data set for further analysis.
The pH chart revelead some very interesting insights. It shows that all wines (independently of the type) of the data set have values between ~2.75 and ~4.0. This means that all wines in the sample are rather acidic than basic. The author wanders if this is the case for all wine or just a coincidence. But this quenstion is not subject to be answered within this analysis. The chart further reveals that the distribution is very similar among the wine types in this regard.
The other acidity levels look mostly normally distributed or slightly skewed to the right. It generally refelcts that red wine has higher level of acidity, except for citric acid. The range for fixed acitiy, where most values appear, contains values between 4 and 12, volatile acidity contains values between 0.1 and 1, and most values for citric acid range from 0 to 0.75.
Let’s go away from the acidity variables and have a deeper look into the sweetness and alcohol content of the wines.
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 3253 7.9 0.330 0.28 31.6 0.053
## 3263 7.9 0.330 0.28 31.6 0.053
## 4381 7.8 0.965 0.60 65.8 0.074
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 3253 35 176 1.01030 3.15 0.38
## 3263 35 176 1.01030 3.15 0.38
## 4381 8 160 1.03898 3.39 0.69
## alcohol quality type
## 3253 8.8 6 white
## 3263 8.8 6 white
## 4381 11.7 6 white
## df_wine$type: red
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
## --------------------------------------------------------
## df_wine$type: white
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.379 9.900 31.600
The 1st chart shows that show that values between 1 and 20 are very frequent for residual sugar with an outlier over 60. This might be due to an entry error, as this data point seems very far from the others. The 2nd charts indicate that most of the wines have residual sugar in a range of 1 to 3, nevertheless the chart shows that white wines tend to have more frequently higher values than red wines. This is even more evident in the boxplot. Let’s jump to alcohol.
Note: Remove data points over 60 for furhter anlysis.
The chart reveals a right skewed distribution of alcohol. The alcohol chart shows that most wines have a alcohol percentage of 9 - 13%, whereas only a few reach values below 8.4% and only a few contain more than 13% alcohol. This trend is similar among red and white wine.
The author created 3 data sets: - df_red containing only red wine sampels (1599 observations, 13 variables) - df_white containing only white wine samples (4898 observations, 13 variables) - df_wine aggregated data set with both samples (6497 observations, 13 variables)
The author would argue that there are 2 main features of interest in the data set. The ultimate interest in the data set is the quality variable. The author would like to determine which factors influence the quality of a wine and which not. Furthermore, the author would like to explore the differences of the two wine samples and test if higher scores in quality of white wine are statistically significant.
The author creates the variable “type” in the aggregated data set and removed the original row number variable “X”. For now, this was everything. The author might create in the further process of the analysis other variables, e.g. cut and label the fixed acitiy variable.
Residual sugar had a very uncommon outlier with a value over 60 This data point was removed for the further analysis.
The pair plot showed some interesting insights. Our data set does not show any high correlations that are related to quality. The highest correlation that can be obsorved with regard to quality is alcohol (corr 0.444). The combined data set further highlights a weeker correlation between density and quality (corr -0.312), as well as chlorides (corr -0.201) and volatile acidity (corr -0.266) with quality. All other values seem to have a very week correlation with quality (higher -0.2 or lower 0.2). With disregard of the quality, the highest correlations in the data set can be found among:
density and fixed acidity (0.466)
total sulfur dioxide and volitile acidity (-0.415)
pH and citric acid (-0.33)
density and residual sugar (0.539)
sulphates and chlorides (0.396)
total sulfur dioxide and free sulfur dioxide (0.721)
alcohol and density (-0.701)
Interestingly, the data set reveals different insights by subsetting the data by the type of wine. The quality of white wine seems rather to be determined by volitile acidity (corr -0.195), chlorides (corr -0.21), density (corr -0.307) and alcohol (0.436), while the quality of red wine seem also to be influenced by volitile acidity (corr -0.391), alcohol (corr 0.476) and density (corr -0.175), but also by citric acidity (corr 0.226), total sulfur dioxide (corr -0.185) and sulphates (corr 0.251).
The correlation pair plot seem to indicate that wine type adds variance in the quality of a wine. Although, all correlations tend to be rather week than strong, where alcohol has the highest correlation with quality in all data sets.
Lets explore the variables who have higher level of correlations with quality in the combined data set.
The above plotted chart show the relationship with each criteria with higher correlation with quality. The jitter (points) in the chart show that the correlations tend to be rather small and not linear. Nevertheless, we can notice that alcohol level decreases from quality 3-5 and than it starts to increase with higher quality. Density, chlorides and volatile acidity in regard of quality, shows a lot of variance in the chart, but overall it seems that wine with higher quality ratings are less dense and have lower levels of chlorides and volatile acidity.
It seems overall surprising to see that alcohol is the only variable to show a clearer picture with its relationship with quality. But even more, it is interesting to see that alcohol seems to be the main driver for quality compared to the other variables.
Let’s have a look how residual sugar and pH-level might influence the quality of a wine. One could think that the level of sweetness our sourness might influenc the quality of a wine.
These charts even further indicate that there is no direct relationship with the sweetness or pH level of a wine with its quality. It follows our correlation levels. The results have been quite surprising so far, indicating that there is no clear indicator to determine what factor raise the quality of a wine (except for alcohol). A reason might be that individuals tastes are too different from each other and a further subsetting of the data by test group (irregular, regular, experts, etc.) may be needed to bring a clear pattern in it.
If we are not able to find clear relation ships with the quality variable, let’s investigate other relationships in the data set. It is interesting to find the highest correlations in the combined data set are mostly related to density of a wine. Residual sugar and fixed acidity show a positive correlation with density, while alchol shows a negative correlational relationship with density. We could probably try to build a model to predict the density of a wine, but this aspect is not of major interest for us. The relationship between volatile acidity and total sulfur dioxide looks also kind of linear, but is also not of major interest.
Let’s have a look on the relationship between pH and citric acid, sulphates and chlorides and total sulfur dioxide and free sulfur dioxide. This correlations were among the highest found in the data set.
Citric acid tend to influence the pH value of a wine. It seems like higher levels of citric acid lead to even more overall acid wine (lower pH values). The strongest relationship can be obsorved between total and free sulfur dioxide. These relationship seem to be the strongest in the whole data set. But since non of these correlate highly with the quality of a wine, they are not of major interest.
To sum up, there were no high correlating relationships found in the data set. The highest relationship was found among alcohol and quality of wine, independently of the type of wine. Beside alcohol, the density of a wine seem to have the strongest influence on quality. What was interesting to see is that the wine types seperated indicated different correlations with variables, indicating that wine type would add some variance in a comprehensive model. Residual sugar and pH value seem to be not related to the perceived quality of a wine.
There are some relationships with medium to high level of correlation in the data set:
density and fixed acidity (0.466)
total sulfur dioxide and volitile acidity (-0.415)
pH and citric acid (-0.33)
density and residual sugar (0.539)
sulphates and chlorides (0.396)
total sulfur dioxide and free sulfur dioxide (0.721)
alcohol and density (-0.701)
It is really questionable if these relationship are of high interest. Probably, it depends of the interest group studying the topic. But for the authors aim of the analysis to determine factors influencing the quality of a wine, these relationships - bside showing a higher level of correlations - are not of major importance.
The highest correlation that can be obsorved with regard to quality is alcohol (corr 0.444). Among other variables a high correlationg among total sulfur dioxide and free sulfur dioxide (0.721) exists.
Let’s first have a look at the highest correlation levels with quality and subset the data by type.
This image draws a very intresting image. It is very interesting to see that red and white wine follow similar trends and have also similar level of correlations (white: 0.44, red: 0.48). It seems that up to a medium level of quality (up to rating 5), the alcohol content does not to be of gratest importance, but as better the wine gets, the alcohol level increases. Let’s see how density varies depending on type and quality.
Interestingly, the quality of red wine show a much lower correlation (-0.175) with density. The chart also seem to indicate the reason for the lover correlation, since the range of white wine is way higher. By narrowing the desnsity range of white wine between >= 0.990 and <= 1.005, the correlation decreases to (-0.258). The chart further shows that red wine seem to have rather constant level of density (or just very little changes), independently of the level of quality But nevertheless, the influence of the wine type is small. Let’s have a look on chlorides.
We can observe in the plot that wines with higher quality tend to have lower levels of chlorides. This is a bit better visible for white wine instead of red. But nevertheless, the relationships seems to be rather week and with a lot of variance within the variables.
Interestingly, the data set reveals different insights by subsetting the data by the type of wine. The quality of white wine seems rather to be determined by volitile acidity (corr -0.195), chlorides (corr -0.21), density (corr -0.307) and alcohol (0.436), while the quality of red wine seem also to be influenced by volitile acidity (corr -0.391), alcohol (corr 0.476) and density (corr -0.175), but also by citric acidity (corr 0.226), total sulfur dioxide (corr -0.185) and sulphates (corr 0.251).
In this chart we can understand the relationship between quality and citric acid seperated by wine type. The image indicates that in case of white wine, there is hardly any movement visible and many outliers can be obsorved. In case of red wine, the chart revels that with higher quality, red wine tend to have higher levels of citric acid. Let’s have a look on sulphaets.
This chart also highlights the different relationship between quality and sulphates on the different wine types. While red wine tend to have an increased level of sulphates in higher quality wine, this does not seem to be the case for white wine as well.
Having analyzed the differences and relationships on quality by wine types, let’s have a closer look on the relationship of density with residual sugar, as well as sulphates and alcohol.
The first plot reveals that wines with lower levels of residual sugar, but higher levels of alcohol tend to decrease the density of a wine. Or in other words, wines with higher density tend to have high levels of residual sugar and lower levels of alcohol.
In the case of red wine, the chart indicates that higher levels of sulphates and alcohol tend to achieve higher ratings in terms of quality, while this relation tends to be less obvious for white wine. White wine seem to increase the perceived quality with higher alcohol content, but not necessarily with higher levels of sulphates.
##
## Calls:
## m1: lm(formula = I(quality ~ I(alcohol)), data = df_wine)
## m2: lm(formula = quality ~ I(alcohol) + density, data = df_wine)
## m3: lm(formula = quality ~ I(alcohol) + density + chlorides, data = df_wine)
## m4: lm(formula = quality ~ I(alcohol) + density + chlorides + volatile.acidity,
## data = df_wine)
## m5: lm(formula = quality ~ I(alcohol) + density + chlorides + volatile.acidity +
## type, data = df_wine)
## m6: lm(formula = quality ~ I(alcohol) + density + chlorides + volatile.acidity +
## type + fixed.acidity, data = df_wine)
## m7: lm(formula = quality ~ I(alcohol) + density + chlorides + volatile.acidity +
## type + fixed.acidity + pH, data = df_wine)
## m8: lm(formula = quality ~ I(alcohol) + density + chlorides + volatile.acidity +
## type + fixed.acidity + pH + residual.sugar, data = df_wine)
## m9: lm(formula = quality ~ I(alcohol) + density + chlorides + volatile.acidity +
## type + fixed.acidity + pH + residual.sugar + citric.acid,
## data = df_wine)
## m10: lm(formula = quality ~ I(alcohol) + density + chlorides + volatile.acidity +
## type + fixed.acidity + pH + residual.sugar + citric.acid +
## sulphates, data = df_wine)
##
## ================================================================================================================================================================
## m1 m2 m3 m4 m5 m6 m7 m8 m9 m10
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 2.405*** 2.494 -8.224 -37.526*** -27.468*** -37.649*** -38.263*** 93.917*** 93.524*** 134.476***
## (0.086) (4.678) (4.825) (4.801) (5.260) (5.739) (5.796) (14.833) (14.894) (15.383)
## I(alcohol) 0.325*** 0.325*** 0.325*** 0.385*** 0.365*** 0.380*** 0.382*** 0.244*** 0.245*** 0.193***
## (0.008) (0.011) (0.011) (0.011) (0.012) (0.012) (0.013) (0.019) (0.019) (0.020)
## density -0.088 10.827* 40.034*** 30.345*** 40.764*** 41.579*** -92.695*** -92.299*** -133.863***
## (4.618) (4.773) (4.752) (5.184) (5.691) (5.793) (15.031) (15.092) (15.590)
## chlorides -2.491*** -0.256 -0.792* -0.710* -0.737* -0.131 -0.114 -0.675*
## (0.296) (0.300) (0.321) (0.321) (0.323) (0.327) (0.333) (0.335)
## volatile.acidity -1.478*** -1.662*** -1.717*** -1.718*** -1.683*** -1.691*** -1.579***
## (0.063) (0.075) (0.076) (0.076) (0.075) (0.080) (0.080)
## type: white/red -0.157*** -0.199*** -0.211*** -0.513*** -0.510*** -0.494***
## (0.034) (0.035) (0.038) (0.049) (0.050) (0.050)
## fixed.acidity -0.040*** -0.044*** 0.073*** 0.074*** 0.103***
## (0.009) (0.011) (0.016) (0.016) (0.017)
## pH -0.055 0.516*** 0.513*** 0.591***
## (0.073) (0.093) (0.093) (0.093)
## residual.sugar 0.058*** 0.058*** 0.074***
## (0.006) (0.006) (0.006)
## citric.acid -0.024 -0.071
## (0.080) (0.079)
## sulphates 0.740***
## (0.076)
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.197 0.197 0.206 0.268 0.270 0.272 0.272 0.283 0.283 0.293
## adj. R-squared 0.197 0.197 0.206 0.267 0.269 0.272 0.271 0.282 0.282 0.292
## sigma 0.782 0.782 0.778 0.748 0.746 0.745 0.745 0.740 0.740 0.735
## F 1597.432 798.593 561.649 592.939 480.165 404.510 346.780 319.444 283.920 268.520
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -7622.694 -7622.694 -7587.547 -7325.456 -7314.686 -7304.980 -7304.697 -7258.221 -7258.178 -7211.691
## Deviance 3975.690 3975.689 3932.899 3628.008 3615.999 3605.209 3604.895 3553.680 3553.632 3503.134
## AIC 15251.389 15253.388 15185.093 14662.911 14643.373 14625.960 14627.394 14536.443 14538.356 14447.383
## BIC 15271.726 15280.504 15218.988 14703.585 14690.825 14680.192 14688.404 14604.232 14612.924 14528.730
## N 6496 6496 6496 6496 6496 6496 6496 6496 6496 6496
## ================================================================================================================================================================
The first predictive model built shows that the highest r-squared value that we can achieve was 0.293 with 10 variables. This seems to be a rather unprecize model with a lot of variables in it. Let’s try to reduce it’s complexity by keeping the precision (even though it’s on a low level).
##
## Calls:
## m11: lm(formula = I(quality ~ I(alcohol)), data = df_wine)
## m12: lm(formula = quality ~ I(alcohol) + chlorides, data = df_wine)
## m13: lm(formula = quality ~ I(alcohol) + chlorides + volatile.acidity,
## data = df_wine)
## m14: lm(formula = quality ~ I(alcohol) + chlorides + volatile.acidity +
## type, data = df_wine)
## m15: lm(formula = quality ~ I(alcohol) + chlorides + volatile.acidity +
## type + residual.sugar, data = df_wine)
##
## ==========================================================================================
## m11 m12 m13 m14 m15
## ------------------------------------------------------------------------------------------
## (Intercept) 2.405*** 2.717*** 2.911*** 3.320*** 2.883***
## (0.086) (0.094) (0.091) (0.105) (0.113)
## I(alcohol) 0.325*** 0.308*** 0.319*** 0.313*** 0.349***
## (0.008) (0.008) (0.008) (0.008) (0.009)
## chlorides -2.308*** 0.161 -0.798* -0.601
## (0.285) (0.298) (0.322) (0.320)
## volatile.acidity -1.337*** -1.667*** -1.690***
## (0.061) (0.075) (0.074)
## type: white/red -0.236*** -0.326***
## (0.031) (0.032)
## residual.sugar 0.023***
## (0.002)
## ------------------------------------------------------------------------------------------
## R-squared 0.197 0.205 0.260 0.266 0.277
## adj. R-squared 0.197 0.205 0.259 0.266 0.277
## sigma 0.782 0.779 0.752 0.748 0.743
## F 1597.432 839.366 758.756 588.622 498.486
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -7622.694 -7590.120 -7360.771 -7331.792 -7281.392
## Deviance 3975.690 3936.016 3667.670 3635.092 3579.121
## AIC 15251.389 15188.239 14731.541 14675.583 14576.783
## BIC 15271.726 15215.355 14765.436 14716.257 14624.236
## N 6496 6496 6496 6496 6496
## ==========================================================================================
The second predictive model reached a highest r-squared value of 0.277 and is not much worse than the first one (r-squared: 0.293). The second model only takes 5 variables as input and is therefore much simpler than the first one without losing a lot on its precison.
##
## Shapiro-Wilk normality test
##
## data: df_white$quality
## W = 0.88904, p-value < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: df_red$quality
## W = 0.85759, p-value < 2.2e-16
##
## Wilcoxon rank sum test with continuity correction
##
## data: df_wine$quality by df_wine$type
## W = 3311000, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## red white
## 5.636023 5.877884
Since the Shapiro-Wilk nomrality test revealed that our data set isn’t normally distributed, a comparison of the mean quaity of red and white wine was made with the wilcox test. It revealed that there is a significant difference between the quality of red and white wine, suggestion that white wine tend to have higher quality in average than red wine.
This section tried to digg deeper into commonalities and differences of red and white wine. It was internesting to see the different variables influencing quality seperated by wine type. Especially, it is interesting to see the behavior of red and white wine quality influenced by alcohol, which seems to be the only real variable driving the quality of wine. All other variables(density, chlorides, citric acid and residual sugar) seem to have weaker influence on quality, which also follows the indication of low overall correlation levels found in the previous section. An interesting observation was made by plotting sulphates and alcohol and facetting those variables by type and color them with the quality score. It seems like red wines with higher levels of alcohol and sulphates receive higher quality scores, whereas this relationship is not clearly visible on white wine.
There is an interesting relationship between residual sugar, alcohol and density. More dense wine with lower alcohol level tend to have higher levels of residual sugar. But even in this relationship a clear conclusion or pattern is hard to identify.
The author tried to create a model to predict the quality outcome of a wine based on several inputs. But overall, the model is not very precise as the R-squared value is maximized with a level of 0.293. This replicates as well that only lower levels of correlation can be found in the data set. Consequently, predicting the quality of a wine with the existing variables is rather hard.
The best model in terms of amount of number of variabels and effectivness can be built with the variables chlorides, volatile aciidity, type and residual sugar.
The author created further tried to test the significance of the average quality of the wine types. A first test indicated that the quality variable is not normaly distributed and therefore, a wilcox test was applied. The result show that white wine receives in average better ratings in terms of quality than red wine.
This first plot shows the distribution of quality ratings split by red and white wine. One can see that both wine types are rated in a similar mannner. The interquartile range is 1 at both wine types, showing that most quality ratings are between 5 and 6. Lowest rating received were 3 for both wine types, while white wine received a maximum rating of 9 compared with the maximum of 8 for red wine. Although the plot shows similar figures, one can see that the avarage rating of white wine is higher compared to the red wine. The Shaprio-Wilk test revealed that the data of both wine types are not normally distributed. Furthermore, the wilcox test indicated that there is a statiscally significant difference in the mean quality rating of red and white wine.
Correlation test indicated that in all 3 data sets do not contain very high (+/- 0.75) correlations with the quality variable. The highest correlation with could be obsorved with alcohol. This relationship is also visible in the 2nd plot, which shows that the perceived quality of wines, independently from its type, tend to be better with higher alcohol content. It also visualizes why similar levels of correlations can be found among the different data sets (red: 0.436 , white: 0.476)
Plot 3 reveals interesting insights between the relationship of alcohol content and sulphates, colored by wine quality and seperated by wine type. In the case of red wine, it indicates that higher levels of sulphates and alcohol tend to achieve higher ratings in terms of quality, while this relation tends to be less obvious for white wine. White wine seem to increase the perceived quality with higher alcohol content, but not necessarily with higher levels of sulphates.
The EDA task was a very interesting project for the author trying to get insights into the chemical substances of wine and its influence on quality. Although the author was very exited in the beginning of the project, he had to deal with some obstacles and disappointments. It was nice to gain an understanding of the single variables of the data set, but as soon as the varaiables were plotted against each other, it was rather disappointing to see lower levels of correlations, especially regarding quality. The author at that point assumed that it is not gonna be easy to try to create a good predictive model for the quality of wine. Nevertheless, it was an interesting overall project for the author to apply his new skills on a data set with only little guidance of the template. On the other hand, this point makes it hard to decide when you reach your ending point of the analysis, as the analysis could go forever. The author found it helpful to have clear goals to understand primarily the influence on quality, rather the influence on chemical substances among each other.
This aspect could certainly offer possibilities for future research, on the other hand the analysis show that we need other data or metrics to understand how quality of wine is perceived from an individual to build better models. It would be maybe interesting to add as well the kind of grapes or the level of wine expereience of the individual testers.
Overall, the project was a great expereince.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib